May 10, 2021

Overview of presentation

  1. Introduction to COVID-19 World Vaccine Adverse Reactions Dataset
  2. Project work flow
  3. Project methods: important packages and verbs used
  4. Challenges and solutions - Load, Clean and Augment
  5. Visualizations
  6. Modeling
  7. Conclusion and discussion

COVID-19 World Vaccine Adverse Reactions

COVID-19 World Vaccine Adverse Reactions

COVID-19 World Vaccine Adverse Reactions

PATIENTS.CSV: Contains information about the individuals that received the vaccines

## # A tibble: 3 x 35
##   VAERS_ID RECVDATE  STATE AGE_YRS CAGE_YR CAGE_MO SEX   RPT_DATE   SYMPTOM_TEXT
##   <chr>    <chr>     <chr>   <dbl>   <dbl>   <dbl> <chr> <date>     <chr>       
## 1 0916600  01/01/20… TX         33      33      NA F     NA         "Right side…
## 2 0916601  01/01/20… CA         73      73      NA F     NA         "Approximat…
## 3 0916602  01/01/20… WA         23      23      NA F     NA         "About 15 m…
## # … with 26 more variables: DIED <chr>, DATEDIED <chr>, L_THREAT <chr>,
## #   ER_VISIT <chr>, HOSPITAL <chr>, HOSPDAYS <dbl>, X_STAY <chr>,
## #   DISABLE <chr>, RECOVD <chr>, VAX_DATE <chr>, ONSET_DATE <chr>,
## #   NUMDAYS <dbl>, LAB_DATA <chr>, V_ADMINBY <chr>, V_FUNDBY <chr>,
## #   OTHER_MEDS <chr>, CUR_ILL <chr>, HISTORY <chr>, PRIOR_VAX <chr>,
## #   SPLTTYPE <chr>, FORM_VERS <dbl>, TODAYS_DATE <chr>, BIRTH_DEFECT <chr>,
## #   OFC_VISIT <chr>, ER_ED_VISIT <chr>, ALLERGIES <chr>

COVID-19 World Vaccine Adverse Reactions

VACCINES.CSV: Contains information about the received vaccine

## # A tibble: 3 x 8
##   VAERS_ID VAX_TYPE VAX_MANU VAX_LOT VAX_DOSE_SERIES VAX_ROUTE VAX_SITE VAX_NAME
##   <chr>    <chr>    <chr>    <chr>   <chr>           <chr>     <chr>    <chr>   
## 1 0916600  COVID19  "MODERN… 037K20A 1               IM        LA       COVID19…
## 2 0916601  COVID19  "MODERN… 025L20A 1               IM        RA       COVID19…
## 3 0916602  COVID19  "PFIZER… EL1284  1               IM        LA       COVID19…

COVID-19 World Vaccine Adverse Reactions

SYMPTOMS.CSV: Contains information about the symptoms experiences after vaccination

## # A tibble: 3 x 11
##   VAERS_ID SYMPTOM1      SYMPTOMVERSION1 SYMPTOM2   SYMPTOMVERSION2 SYMPTOM3    
##   <chr>    <chr>                   <dbl> <chr>                <dbl> <chr>       
## 1 0916600  Dysphagia                23.1 Epiglotti…            23.1 <NA>        
## 2 0916601  Anxiety                  23.1 Dyspnoea              23.1 <NA>        
## 3 0916602  Chest discom…            23.1 Dysphagia             23.1 Pain in ext…
## # … with 5 more variables: SYMPTOMVERSION3 <dbl>, SYMPTOM4 <chr>,
## #   SYMPTOMVERSION4 <dbl>, SYMPTOM5 <chr>, SYMPTOMVERSION5 <dbl>

Project workflow

  1. Load data sets (patients, vaccines, symptoms)
  2. Clean each data set individually
  3. Augment and merge the data sets
  4. Make visualizations
  5. Do modeling

Project methods - Important packages and verbs

Load and clean

  • readr: read_csv(), write_csv()
  • dyplyr: filter(), select(), distinct(), mutate()
  • tidyr: replace_na()

Augment

  • dplyr: filter(), select(), mutate(), case_when(), arrange(), group_by(), count(), distinct(), summarise(), drop_na(), rename()
  • tidyr: pivot_longer(), pivot_wider(), inner_join(), full_join(), pluck()
  • stringr: regular expressions, str_c(), str_replace(), str_replace()

Visualizations and modeling

  • ggplot: geom_bar(), geom_boxplot(), geom_tile(), geom_segment(), theme_minimal()
  • forcats: fct_reorder()
  • scales
  • patchwork
  • viridis
  • stats (?): glm(), prcomp()
  • broom: tidy(), glance()
  • purrr: map(), nest(), ungroup()

04_analysis_visualizations - Important tools used

Important verbs and tools used:

01_load - Challenges and Solutions 1

CHALLENGE: Multiple large files

SOLUTION: Keep them compressed and only decompress when reading into R:

01_load - Challenges and Solutions 2

CHALLENGE: Wrong column types automatically assigned by R

## Warning: 241 parsing failures.
##  row          col           expected     actual         file
## 1465 BIRTH_DEFECT 1/0/T/F/TRUE/FALSE Y          <connection>
## 2742 X_STAY       1/0/T/F/TRUE/FALSE Y          <connection>
## 2807 RPT_DATE     1/0/T/F/TRUE/FALSE 2021-01-04 <connection>
## 2807 V_FUNDBY     1/0/T/F/TRUE/FALSE OTH        <connection>
## 2811 RPT_DATE     1/0/T/F/TRUE/FALSE 2021-01-04 <connection>
## .... ............ .................. .......... ............
## See problems(...) for more details.

SOLUTION: Manually assign column types

01_load - Challenges and Solutions 3

CHALLENGE: NA strings (“NA”, “N/A”, “Unknown”, " "…)

SOLUTION:

02_clean

02_clean - Challenges and Solutions 1

I am aware of how horrible this table is :/

CHALLENGE SOLUTION
Unwanted columns select(-c())
NAs that should be interpreted as “no” replace_na()
Row duplications distinct()
Individuals who got more than one vaccine type (generates noise) add_count(VAERS_ID) %>% filter(n==1) %>% select(-n)

03_augment

03_augment - Challenges and Solutions 1

CHALLENGE: Some columns contain long string descriptions that need to be turned into something tidy

SOLUTION: Make categorical variable

03_augment - Challenges and Solutions 1

Example: ALLERGIES column:

Make categorical variable that states if patient has allergies or not:

Clean categorical HAS_ALLERGIES column:

## # A tibble: 5 x 3
##   VAERS_ID ALLERGIES                                               HAS_ALLERGIES
##   <chr>    <chr>                                                   <chr>        
## 1 0916603  Diclofenac, novacaine, lidocaine, pickles, tomatoes, m… Y            
## 2 0916604  <NA>                                                    N            
## 3 0916660  Penicillin                                              Y            
## 4 0916685  none that I am aware of                                 N            
## 5 0917437  No known allergies                                      N

03_augment - Challenges and Solutions 1

Another example: OTHER_MEDS column

Detect individuals that have taken anti-inflammatory or steroid drugs before vaccine (not recommended):

Clean, categorial TAKES_ANTIINFLAMMATORY and TAKES_STEROID columns:

## # A tibble: 4 x 4
##   VAERS_ID OTHER_MEDS                           TAKES_ANTIINFLAM… TAKES_STEROIDS
##   <chr>    <chr>                                <chr>             <chr>         
## 1 0918421  1 aspirin a day 81 mg, levothyroxin… Y                 N             
## 2 0921732  Ibuprofen - PRN  States she does no… Y                 N             
## 3 0932980  Hydrocortisone 25mg daily.  Fludroc… N                 Y             
## 4 0934539  Singulair, Oxybutynin, Fosamax, Pre… N                 Y

03_augment - Challenges and Solutions 2

CHALLENGE: Symptoms are recorded in a way that makes later analysis difficult

## # A tibble: 5 x 6
##   VAERS_ID SYMPTOM1           SYMPTOM2        SYMPTOM3 SYMPTOM4         SYMPTOM5
##   <chr>    <chr>              <chr>           <chr>    <chr>            <chr>   
## 1 0916618  Injection site pa… Pain            <NA>     <NA>             <NA>    
## 2 0916619  Injection site pa… Menorrhagia     <NA>     <NA>             <NA>    
## 3 0916620  Arthralgia         Chills          Headache Mobility decrea… Myalgia 
## 4 0916620  Nausea             Pain in extrem… Pyrexia  <NA>             <NA>    
## 5 0916621  Chills             Fatigue         Headache Myalgia          <NA>

SOLUTION: 20 most common symptoms are found and turned into TRUE/FALSE columns

## # A tibble: 3 x 21
##   VAERS_ID HEADACHE PYREXIA CHILLS FATIGUE PAIN  PAIN_IN_EXTREMITY NAUSEA
##   <chr>    <lgl>    <lgl>   <lgl>  <lgl>   <lgl> <lgl>             <lgl> 
## 1 0916600  FALSE    FALSE   FALSE  FALSE   FALSE FALSE             FALSE 
## 2 0916601  FALSE    FALSE   FALSE  FALSE   FALSE FALSE             FALSE 
## 3 0916602  FALSE    FALSE   FALSE  FALSE   FALSE TRUE              FALSE 
## # … with 13 more variables: DIZZINESS <lgl>, MYALGIA <lgl>,
## #   INJECTION_SITE_ERYTHEMA <lgl>, INJECTION_SITE_PRURITUS <lgl>,
## #   INJECTION_SITE_SWELLING <lgl>, INJECTION_SITE_PAIN <lgl>, ARTHRALGIA <lgl>,
## #   DYSPNOEA <lgl>, VOMITING <lgl>, PRURITUS <lgl>, DEATH <lgl>, RASH <lgl>,
## #   ASTHENIA <lgl>

04_analysis_visualizations

04_analysis_visualizations - Age, sex and vaccine manufacturer distribution

04_analysis_visualizations - Age distribution

04_analysis_visualizations - Age manufacturer distribution

04_analysis_visualizations - Sex and vaccine manufacturer distribution

Sex distribution
SEX n
F 24070
M 8514
NA 828
Vaccine manufacturer distribution
VAX_MANU n
JANSSEN 1106
MODERNA 16253
PFIZER-BIONTECH 16053

04_analysis_visualizations - Days until onset of symptoms vs. Age Group

04_analysis_visualizations - Age/sex vs. number of symptoms

04_analysis_visualizations - Vaccine manufacturer vs. number of symptoms

04_analysis_visualizations - Age vs. types of symptoms

04_analysis_visualizations - Sex vs. types of symptoms

04_analysis_visualizations - Vaccine manufacturer vs. types of symptoms

04_analysis_visualizations - vaccine manufacturer vs. death

04_analysis_regressions

04_analysis_modeling - Logistic regression: death ~ patient profile

Include code?

## # A tibble: 7 x 6
##   term           estimate std.error statistic  p.value odds_ratio
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
## 1 (Intercept)    -9.34      0.161    -58.0    0         0.0000876
## 2 SEXM            0.924     0.0573    16.1    2.18e-58  2.52     
## 3 AGE_YRS         0.0915    0.00207   44.2    0         1.10     
## 4 HAS_ALLERGIESY -0.100     0.0608    -1.65   9.82e- 2  0.904    
## 5 HAS_ILLNESSY    1.10      0.0664    16.6    6.60e-62  3.01     
## 6 HAS_COVIDY     -0.117     0.148     -0.791  4.29e- 1  0.890    
## 7 HAD_COVIDY      0.00915   0.193      0.0474 9.62e- 1  1.01

04_analysis_modeling - Logistic regression: death ~ patient profile

04_analysis_modeling - Logistic regression: death ~ patient profile

04_analysis_modeling - Logistic regression: death ~ symptoms

Include code?

## # A tibble: 20 x 6
##    term                        estimate std.error statistic  p.value  odds_ratio
##    <chr>                          <dbl>     <dbl>     <dbl>    <dbl>       <dbl>
##  1 (Intercept)                   -2.01     0.0287  -70.1    0        0.134      
##  2 HEADACHETRUE                  -1.67     0.156   -10.7    7.92e-27 0.188      
##  3 PYREXIATRUE                   -0.429    0.112    -3.82   1.34e- 4 0.651      
##  4 CHILLSTRUE                    -1.21     0.171    -7.11   1.17e-12 0.298      
##  5 FATIGUETRUE                   -0.367    0.115    -3.19   1.41e- 3 0.693      
##  6 PAINTRUE                      -0.913    0.153    -5.98   2.17e- 9 0.401      
##  7 NAUSEATRUE                    -0.621    0.139    -4.46   8.17e- 6 0.538      
##  8 DIZZINESSTRUE                 -2.17     0.193   -11.2    2.87e-29 0.114      
##  9 PAIN_IN_EXTREMITYTRUE         -1.43     0.194    -7.38   1.56e-13 0.239      
## 10 MYALGIATRUE                   -1.57     0.264    -5.94   2.91e- 9 0.209      
## 11 INJECTION_SITE_PAINTRUE       -1.37     0.248    -5.54   2.95e- 8 0.253      
## 12 INJECTION_SITE_ERYTHEMATRUE  -15.1    186.       -0.0811 9.35e- 1 0.000000285
## 13 ARTHRALGIATRUE                -1.68     0.338    -4.97   6.73e- 7 0.186      
## 14 DYSPNOEATRUE                   0.509    0.0845    6.02   1.73e- 9 1.66       
## 15 VOMITINGTRUE                   0.677    0.135     5.02   5.06e- 7 1.97       
## 16 PRURITUSTRUE                  -3.67     0.579    -6.33   2.39e-10 0.0256     
## 17 INJECTION_SITE_SWELLINGTRUE  -14.4    206.       -0.0696 9.44e- 1 0.000000580
## 18 RASHTRUE                      -2.62     0.356    -7.36   1.91e-13 0.0728     
## 19 ASTHENIATRUE                   0.442    0.122     3.62   2.94e- 4 1.56       
## 20 INJECTION_SITE_PRURITUSTRUE  -14.5    227.       -0.0639 9.49e- 1 0.000000509

04_analysis_modeling - Logistic regression: death ~ symptoms

04_analysis_modeling - Logistic regression: death ~ symptoms

04_analysis_modeling - Many logistic regressions: symptoms ~ takes anti-inflamatory

Include code?

## # A tibble: 20 x 9
##    SYMPTOM   estimate std.error statistic  p.value conf.low conf.high odds_ratio
##    <chr>        <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>      <dbl>
##  1 HEADACHE   -0.170     0.0954    -1.79  7.42e- 2  -0.361    0.0133       0.843
##  2 PYREXIA     0.0734    0.0967     0.760 4.48e- 1  -0.120    0.259        1.08 
##  3 CHILLS     -0.0727    0.103     -0.703 4.82e- 1  -0.280    0.126        0.930
##  4 FATIGUE     0.0226    0.102      0.221 8.25e- 1  -0.183    0.219        1.02 
##  5 PAIN        0.0190    0.106      0.179 8.58e- 1  -0.194    0.222        1.02 
##  6 NAUSEA     -0.0574    0.116     -0.495 6.21e- 1  -0.291    0.164        0.944
##  7 DIZZINESS  -0.187     0.132     -1.42  1.57e- 1  -0.455    0.0633       0.829
##  8 PAIN_IN_…  -0.0720    0.133     -0.541 5.89e- 1  -0.342    0.180        0.931
##  9 MYALGIA    -0.167     0.143     -1.17  2.42e- 1  -0.458    0.102        0.846
## 10 INJECTIO…   0.0938    0.131      0.718 4.73e- 1  -0.171    0.342        1.10 
## 11 INJECTIO…   0.0935    0.144      0.647 5.17e- 1  -0.201    0.366        1.10 
## 12 ARTHRALG…   0.228     0.141      1.62  1.06e- 1  -0.0589   0.495        1.26 
## 13 DYSPNOEA    0.325     0.137      2.38  1.75e- 2   0.0470   0.584        1.38 
## 14 VOMITING    0.140     0.159      0.880 3.79e- 1  -0.186    0.439        1.15 
## 15 PRURITUS    0.0229    0.166      0.138 8.90e- 1  -0.319    0.335        1.02 
## 16 INJECTIO…  -0.364     0.201     -1.81  7.01e- 2  -0.784    0.00822      0.695
## 17 DEATH       0.801     0.120      6.69  2.28e-11   0.559    1.03         2.23 
## 18 RASH       -0.0669    0.180     -0.372 7.10e- 1  -0.439    0.269        0.935
## 19 ASTHENIA    0.478     0.148      3.24  1.21e- 3   0.177    0.757        1.61 
## 20 INJECTIO…   0.424     0.156      2.71  6.75e- 3   0.103    0.718        1.53 
## # … with 1 more variable: identified_as <chr>

04_analysis_modeling - Many logistic regressions: symptoms ~ takes anti-inflamatory

04_analysis_modeling - Many logistic regressions: symptoms ~ takes anti-inflamatory

04_analysis_tests

04_analysis_tests - Important tools used

Important verbs and tools used:

  • prop_test()
  • chisq.test()

04_analysis_clustering

04_analysis_clustering - Important tools used

Important verbs and tools used:

  • prcomp()
  • kmeans()
  • tidymodels: (used for what?)

04_analysis_clustering - PCA biplot

04_analysis_clustering - Rotation matrix

04_analysis_clustering - Scree plot

Conclusion and discussion